San Francisco crime data

Data exploration of temporal and spatial patterns in crime data from San Francisco. Lastly, applying and comparing two classification algorithms - K nearest neighbor and Random Forest - for predicting crime type

Import libraries

1 Data Collection

No collection needed.

2 Preparation

2.1 Load and merge data sources

2.2 Missing values and data types

Check for missing values and replace those in district with 'Unknown'. Change time variables to datetime type.

2.3 Transformations

Adding time variables

3 Exploration

Exploring overall counts as well as temporal and spatial trends for the crimes

3.1 Overall count of crimes

Define function to create barchart

Make plots

Choose focus crimes

For simplicity we already narrow down to six different crime types:

3.2 Temporal patterns

Explore the temporal trends for different crime categories

Define function to create subplots

Yearly development

Excluding year 2018 as only half the year is in the data

Monthly trends

Weekly trends

24 hour cycle trends

3.3 Spatial patterns

Explore trends for location of crimes

OBS: we only have 3 districts - it would have been optimal to map coordinates within all districts of SF

Distrubtion of crimes in districts

Map locations of chosen crimes

It is chosen to only focus on two crime types in year 2017.

Spread of coordinates

Only vandalism and drug/narcotic

4 Modelling and Evaluation

Classification predictive modeling

Supervised machine learning - specifically classification - is used to predict what category of crime is most probably to take place at a detailed time and places in San Francisco

Binary classification is done with following methods:

  1. Baseline model
  2. K nearest neighbors
  3. Random Forest

Predict: Crime type - vandalism or drug/narcotic

Variables

4.1 Data preparation

It is wanted to have a balanced data set, wherefore categories with almost same number of observations are chosen

The choice is:

Training and test set

80/20-split

Validation function

Baseline model

K Nearest Neighbours

Random Forest

Evaluation

Predictions